Skip to content

BUG: DataFrame.stack with sort=True and unsorted MultiIndex levels #53637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 13, 2023

Conversation

rhshadrach
Copy link
Member

@rhshadrach rhshadrach added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jun 13, 2023
@rhshadrach rhshadrach requested a review from mroeschke June 13, 2023 14:57
@mroeschke mroeschke added this to the 2.1 milestone Jun 13, 2023
@mroeschke mroeschke merged commit 5edc2cc into pandas-dev:main Jun 13, 2023
@mroeschke
Copy link
Member

Nice! Thanks @rhshadrach

@rhshadrach rhshadrach deleted the stack_sort_bug branch June 13, 2023 18:10
Comment on lines +2040 to +2050
expected_index = MultiIndex(
levels=[[0, 1, 2, 3], [0, 1]],
codes=[[1, 1, 0, 0, 2, 2, 3, 3], [1, 0, 1, 0, 1, 0, 1, 0]],
)
expected = DataFrame(
{
0: [0, 1, 0, 1, 0, 1, 0, 1],
1: [2, 3, 2, 3, 2, 3, 2, 3],
},
index=expected_index,
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke - I messed up the order of the codes here, it should be [0, 1, 0, 1...] for the 2nd level. I believe this PR didn't actually change any behavior.

In attempting to fix this, I'm finding myself questioning what sort=True should be doing in stack. Should it be equivalent to just calling result.sort_index(), or should it be only sorting the levels that are added to the index from the columns? The latter would be equivalent to:

result = result.sort_index(level=np.arange(self.index.nlevels, result.index.nlevels))

Either one of these breaks ~20 tests in tests/frame/test_stack_unstack.py. As far as I can tell, only some code paths through DataFrame.stack do sorting at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the most predictable result is for sort=True to be equivalent to result.sort_index instead of only sorting the added levels.

When adding the sort keyword I was more in the mindset of sort=False per #15105 of ensuring added level didn't sort the entire result so I'm pretty new to contemplating what sort=True should truly do.

Copy link
Member Author

@rhshadrach rhshadrach Jun 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - digging through that issue helped. It seems like the issue wasn't necessarily the order of the values in the MultiIndex (which is a combination of the order of the levels and the associated codes), but rather the order of the levels themselves. I'm finding that sort=True/False only controls the order of the levels (and not the values). I'm going to think on this a bit more - it's becoming hard to introduce #53515 with this behavior (even before the introduction of sort=False).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understanding that sort=True isn't supposed to sort the values helped; I'm going to revert this PR.

rhshadrach added a commit that referenced this pull request Jun 21, 2023
mroeschke pushed a commit that referenced this pull request Jun 21, 2023
…evels" (#53760)

Revert "BUG: DataFrame.stack with sort=True and unsorted MultiIndex levels (#53637)"

This reverts commit 5edc2cc.
punndcoder28 pushed a commit to punndcoder28/pandas that referenced this pull request Jun 21, 2023
…evels" (pandas-dev#53760)

Revert "BUG: DataFrame.stack with sort=True and unsorted MultiIndex levels (pandas-dev#53637)"

This reverts commit 5edc2cc.
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
…andas-dev#53637)

* BUG: DataFrame.stack with sort=True and unsorted MultiIndex levels

* revert ignoring warnings
Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
…evels" (pandas-dev#53760)

Revert "BUG: DataFrame.stack with sort=True and unsorted MultiIndex levels (pandas-dev#53637)"

This reverts commit 5edc2cc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrame.stack with sort=True and unsorted MultiIndex levels
2 participants